Warning: package 'dplyr' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Warning: package 'readxl' was built under R version 4.4.1
1 Original Data Visualization in News Media
The quality of air in urban environments is a pressing concern, intertwining public health, environmental policies, and urban planning. Our latest visualization, inspired by the comprehensive data presented by Visual Capitalist (2022), sheds light on the air quality levels across major global cities, offering fresh insights into the state of urban air pollution and its implications. Our project aims to dissect the relationship between air quality indices and urbanization patterns, unveiling trends that may correlate with the countries’ income levels. The visualization spans across various cities worldwide, providing a comparative analysis that highlights both improvements and deteriorations in air quality. While the initial visualization effectively communicates these trends to readers, there are enhancements that could further improve its clarity and depth. Future improvements could include interactive maps, detailed temporal breakdowns, and demographic overlays, allowing for a more nuanced exploration of how urban development strategies impact air quality. This comprehensive approach will enable readers to better understand and see the trends in air quality across global cities.
Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)
2 Critical Assessment of the Original Visualization
The original visualization effectively utilizes red circles to represent PM2.5 concentrations, making it easy to see the relative levels of pollution in each capital city, thereby enhancing both clarity and visual appeal. Each city is clearly labeled, which provides a direct understanding of the represented data. Additionally, the use of color and circle size to indicate levels exceeding the WHO safe limit effectively highlights critical information, drawing attention to areas with severe pollution. The quantitative clarity is also strong, as the circles’ sizes correspond to specific PM2.5 concentration ranges, providing an immediate visual grasp of pollution severity. While the visualization is not interactive, it holds potential for interactivity, which could further enhance user engagement and information depth by allowing for detailed data retrieval. Overall, the original visualization effectively communicates the general trends in PM2.5 pollution across various capital cities. However, there are several shortcomings that we have identified.
Absence of Grid Lines: The lack of grid lines makes it difficult to precisely interpret data and assess scale perception, potentially causing confusion when comparing different cities.
Static Year Selection: The visualization is limited to 2022 data. Including multiple years would provide a more dynamic and comprehensive view of trends over time.
No Regional Differentiation: While each city is labeled, there is no clear regional differentiation which could be useful in understanding broader regional trends and patterns.
Static Presentation: The circles are fixed and do not change based on user input. Dynamic elements, such as bubble sizes or color gradients that adjust over time or based on user-selected parameters, could enhance the visual representation.
Lack of Interaction: There are no interactive elements like info-tips or toggle buttons that allow users to explore the data in more depth or switch views between different time periods or concentration ranges.
Data Density: In highly polluted areas (e.g., cities with PM2.5 levels above 50 µg/m³), the circles become dense and can overlap, making it harder to distinguish individual data points.
3 Proposed Improvements
Improved color coding: The color representation of the data can be improved by adding different colors and their corresponding gradients to allow for more distinct data representation (e.g., using green to signify good air quality, yellow to signify moderate air quality and red to signify bad air quality).
Distinction for regional data: Grouping the cities by countries and countries by region will allow for a better representation of the air quality within the countries and region respectively.
Addition of comparison and filtering options: Filters that highlight a specific or multiple countries and regions will help users view the data according to what they wish to view and compare instead of looking through the whole list of countries or regions to find a specific data point.
Interactive bubbles: Hovering over the bubble will show the PM2.5 concentration related to the country or region it represents.
Expansion on viewable data: Including options to view historical data from before 2022 will help in trend studies regarding the past history of air quality for the country or region.
Change or increase in ways for data representation: Include other ways that the data can be viewed such as a heat map, bar charts or line graphs. Have only a few particles show on the graph and change the scale of the graph instead of having overlapping particles.
4 Data Cleaning
4.1 Data Source Summary
The original data set used for the visualization was sourced from the IQAir 2022 World Air Quality Report. This data however does not appear to be available to the public thus we will be using another dataset from the World Health Organization (WHO) which provides data on air quality for various countries. The dataset contains information on PM2.5 concentrations for different countries and years. Below are the glimpse() and summary() summaries of the data.
WHO Region ISO3 WHO Country Name City or Locality
Length:32191 Length:32191 Length:32191 Length:32191
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Measurement Year PM2.5 (μg/m3) PM10 (μg/m3) NO2 (μg/m3)
Min. :2000 Min. : 0.01 Min. : 1.04 Min. : 0.00
1st Qu.:2014 1st Qu.: 10.35 1st Qu.: 16.98 1st Qu.: 12.00
Median :2016 Median : 16.00 Median : 22.00 Median : 18.80
Mean :2016 Mean : 22.92 Mean : 30.53 Mean : 20.62
3rd Qu.:2018 3rd Qu.: 31.00 3rd Qu.: 31.30 3rd Qu.: 27.16
Max. :2021 Max. :191.90 Max. :540.00 Max. :210.68
NA's :17143 NA's :11082 NA's :9991
PM25 temporal coverage (%) PM10 temporal coverage (%)
Min. : 0.00 Min. : 2.568
1st Qu.: 88.60 1st Qu.: 87.945
Median : 97.00 Median : 96.039
Mean : 90.79 Mean : 90.583
3rd Qu.: 99.00 3rd Qu.: 98.938
Max. :100.00 Max. :100.000
NA's :24916 NA's :26810
NO2 temporal coverage (%) Reference
Min. : 1.923 Length:32191
1st Qu.: 93.208 Class :character
Median : 96.370 Mode :character
Mean : 93.697
3rd Qu.: 98.927
Max. :100.000
NA's :12301
Number and type of monitoring stations Version of the database Status
Length:32191 Min. :2016 Mode:logical
Class :character 1st Qu.:2022 NA's:32191
Mode :character Median :2022
Mean :2022
3rd Qu.:2022
Max. :2022
4.2 Handling of Missing Values
Based on the above summaries, we can see that there are missing values in the dataset in the PM columns. We will need to handle these missing values before proceeding with the changes. Some methods we can use to handle missing values include:
Dropping Missing Values: We can drop rows with missing values if they are not significant in number.
Imputation: We can impute missing values with the mean, median, or mode of the column.
We will impute missing values with the mean of the column using the fill() function from the tidyr package.
We will also need to normalize the column names to ensure consistency and ease of access. This will involve converting all column names to lowercase, replacing spaces with underscores, and removing special characters.
We will convert the data types of the columns to their appropriate types. For example, the year column should be converted to a date type if it is not already in that format.
# A tibble: 6 × 15
who_region iso3 who_country_name city_or_locality measurement_year
<chr> <chr> <chr> <chr> <int>
1 Eastern Mediterranea… AFG Afghanistan Kabul 2019
2 European Region ALB Albania Durres 2015
3 European Region ALB Albania Durres 2016
4 European Region ALB Albania Elbasan 2015
5 European Region ALB Albania Elbasan 2016
6 European Region ALB Albania Elbasan 2017
# ℹ 10 more variables: `pm2.5_(μg/m3)` <dbl>, `pm10_(μg/m3)` <dbl>,
# `no2_(μg/m3)` <dbl>, `pm25_temporal_coverage_(%)` <dbl>,
# `pm10_temporal_coverage_(%)` <dbl>, `no2_temporal_coverage_(%)` <dbl>,
# reference <chr>, number_and_type_of_monitoring_stations <chr>,
# version_of_the_database <dbl>, status <lgl>
5 Conclusion
The data has now been cleaned and is ready for visualization, we will be using ggplot2 to create the visualizations. The proposed improvements will be implemented to enhance the clarity and depth of the visualization, providing a more interactive and informative experience for users. By incorporating these enhancements, we aim to create a more engaging and insightful visualization that effectively communicates the trends in air quality across global cities.